Linear Reranking Model for Chinese Pinyin-to-Character Conversion

نویسندگان

  • Xinxin Li
  • Xuan Wang
  • Waqas Anwar
چکیده

Pinyin-to-character conversion is an important task for Chinese natural language processing tasks. Previous work mainly focused on n-gram language models and machine learning approaches, or with additional hand-crafted or automatic rule-based post-processing. There are two problems unable to solve for word n-gram language model: out-of-vocabulary word recognition and long-distance grammatical constraints. In this study, we proposed a linear reranking model trying to solve these problems. Our model uses minimum error learning method to combine different sub models, which includes word and character n-gram LMs, part-of-speech tagging model and dependency model. Impact of different sub models on the conversion are fully experimented and analyzed. Results on the Lancaster Corpus of Mandarin Chinese show that our new model outperforms word n-gram language model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Pinyin Constraints in Pinyin-to-Character Conversion Task: a Class-Based Maximum Entropy Markov Model Approach

The Pinyin-to-Character Conversion task is the core process of the Chinese pinyin-based input method. Statistical language model techniques, especially ngram-based models, are mostly adopted to solve that task. However, the ngram model only focuses on the constraints between characters, ignoring the pinyin constraints in the input pinyin sequence. This paper improves the performance of the Piny...

متن کامل

A Unified Approach to Transliteration-based Text Input with Online Spelling Correction

This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-based text input methods, such as Chinese and Japanese, because the desired target characters canno...

متن کامل

Input Chinese sentences using digits

Chinese character input is always a key issue in a variety of Chinese based applications especially when only a small number keypad is available. Though many kinds of Chinese character encoding schemes are proposed according to Chinese character characteristics, such as the shape, they are not straightforward and will take users a long time to learn. An easy way is to input via Chinese pinyins....

متن کامل

Towards a Semantic Annotation of English Television News - Building and Evaluating a Constraint Grammar FrameNet

This paper introduces a new approach to solve the Chinese Pinyin-to-character (PTC) conversion problem. The conversion from Chinese Pinyin to Chinese character can be regarded as a transformation between two different languages (from the Latin writing system of Chinese Pinyin to the character form of Chinese,Hanzi), which can be naturally solved by machine translation framework. PTC problem is ...

متن کامل

Accurate Pinyin-English Codeswitched Language Identification

Pinyin is the most widely used romanization scheme for Mandarin Chinese. We consider the task of language identification in Pinyin-English codeswitched texts, a task that is significant because of its application to codeswitched text input. We create a codeswitched corpus by extracting and automatically labeling existing Mandarin-English codeswitched corpora. On language identification, we find...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013